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Abstract 

Recent  advances  in  clustering  have  included  continuous  relaxations  of  the  Cheeger  cut  problem  and  those  which 
address  its  linear  approximation  using  the  graph  Laplacian.  In  this  paper,  we  show  how  to  use  the  graph  Laplacian  to 
solve  the  fully  nonlinear  Cheeger  cut  problem,  as  well  as  the  ratio  cut  optimization  task.  Both  problems  are  connected 
to  total  variation  minimization,  and  the  related  Ginzburg-Landau  functional  is  used  in  the  derivation  of  the  methods. 
The  graph  framework  discussed  in  this  paper  is  undirected.  The  resulting  algorithms  are  efficient  ways  to  cluster  the 
data  into  two  classes,  and  they  can  be  easily  extended  to  case  of  multiple  classes,  or  used  on  a  multiclass  data  set  via 
recursive  bipartitioning.  In  addition  to  showing  results  on  benchmark  data  sets,  we  also  show  an  application  of  the 
algorithm  to  hyperspectral  video  data. 

Keywords:  classification,  spectral  clustering,  Cheeger  cut,  ratio  cut,  graphs,  Ginzburg-Landau  functional,  total 
variation,  graph  Laplacian,  Nystrom  extension  method. 

1  Introduction 

1.1  General  Problem  and  Background 

Total  variation  has  become  increasingly  more  central  to  inverse  problems  in  imaging.  The  classic  paper  by  Rudin, 
Osher  and  Fatemi  [32]  sparked  interest  in  total  variation  by  applying  it  very  successfully  to  denoising  problems.  Later, 
similar  techniques  were  applied  to  areas  such  as  inpainting,  image  restoration  and  blind  deconvolution.  For  example, 
in  [11],  Chan  et  al.  present  a  blind  deconvolution  method  using  total  variation  (TV),  in  which  they  use  an  alternating 
minimization  implicit  iterative  scheme.  In  addition,  while  [9]  proposes  a  nonlinear  primal-dual  method  for  TV-based 
image  restoration,  [10]  formulates  a  higher  order  TV-based  restoration  algorithm. 
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More  recently,  total  variation  has  been  used  successfully  in  a  graphical  framework.  In  [20],  the  authors  outline  the 
graphical  framework,  and  apply  it  to  image  inpainting  and  regularization  using  the  total  variation  functional.  Another 
example  is  [27],  in  which  Lou  et  al.  describe  a  nonlocal  method  for  image  recovery  using  total  variation.  In  [19], 
Gilboa  and  Osher  not  only  formulate  a  nonlocal  TV  method  for  image  regularization,  but  also  extend  it  to  supervised 
segmentation,  where  the  user  marks  small  regions  of  known  information,  and  the  desired  class  is  recovered.  Since 
total  variation  is  becoming  more  popular  in  machine  learning  problems  in  which  data  is  represented  in  a  graphical 
framework,  the  minimum  cut  problem,  which  becomes  minimizing  total  variation  of  an  indicator  function  in  a  graph 
setting,  becomes  a  natural  mechanism  with  which  to  proceed  to  achieve  accurate  data  classification. 

We  thus  consider  the  problem  of  minimizing  the  cut  as  a  way  of  partitioning  the  set  X  into  two  clusters.  Let 
G  —  (V,  E)  represent  an  undirected  graph  with  the  set  of  vertices  V  and  set  of  edges  E ,  and  consider  a  target  set  X  of 
size  n  embedded  in  a  graph  G.  A  weight  function  is  defined  on  the  set  of  edges,  and  represents  a  measure  of  similarity 
between  the  vertices  it  is  connecting.  As  usual,  the  degree  of  a  vertex  x  is  formulated  as  d(x )  =  J2yeV  w(x,  y). 
Denote  by  D  the  diagonal  matrix  containing  the  degrees  of  vertices  as  diagonal  entries,  and  let  W  denote  the  similarity 
matrix  containing  the  weight  function  values.  The  minimum  cut  problem  is  to  find  the  set  S  C  V  such  that  the 
following  value  is  minimized: 

cut{S ,  S)  =  E  w(x,y). 

xes,yes 

Here,  S  indicates  the  complement.  Intuitively,  the  weights  between  vertices  of  different  clusters  should  be  small. 
However,  minimizing  the  cut  leads  to  an  undesirable  solution,  first  and  foremost  because  the  trivial  solution  would  be 
to  cluster  all  the  points  into  one  set,  thus  giving  the  cut  the  value  of  zero.  Even  if  a  non-trivial  partition  condition  is 
enforced,  the  solution  tends  to  isolate  individual  vertices  from  the  rest  of  the  graph.  Usually,  what  is  wanted  is  for  the 
two  clusters  to  be  relatively  close  in  size,  and  one  solution  is  to  include  a  normalization  of  the  cut. 

One  normalization  is  the  ratio  cut.  The  problem  is  to  find  a  subset  S  of  V  such  that 

RatioCut(S ,  S)  =  cut(S ,  S)  ( yyr  + 

V|6| 

is  minimized.  This  is  a  NP  hard  discrete  problem  [38].  One  way  to  simplify  it  would  be  to  allow  the  solution  to  take 
arbitrary  values  in  M.  A  clever  reformulation  of  the  original  task  and  relaxation  of  the  binary  constraint  leads  to  the 
problem  of  finding  the  argument  of  the  following  optimization  task: 


min  (u,  L u) ,  such  that  u  _L  1, 

ueRn 


(1) 


where  L  is  the  graph  Laplacian  D  —  W.  We  emphasize  the  fact  that  the  above  problem  obtains  a  real- valued  solution 
instead  of  a  discrete- valued  one.  To  solve  (1.1),  one  can  apply  the  Raleigh-Ritz  theorem,  and  the  solution  is  given  by 
the  second  eigenvector  of  the  graph  Laplacian.  To  obtain  a  binary  solution,  which  indicates  the  binary  partition,  one 
can  use  several  methods,  the  simplest  of  which  is  thresholding. 

Another  normalization  of  the  cut  is  the  normalized  cut.  If  we  let  vol(S)  =  J2xes  where  d(x)  represents 
the  degree  of  vertex  x,  the  problem  is  modified  to  find  a  subset  S  of  V  such  that 

i Vc„t(RS)  =  c„t(S,S)(^  +  ^) 
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is  minimized.  This  is  a  NP  hard  discrete  problem  [38]  as  well.  We  simplify  it  by  allowing  the  solution  to  take  arbitrary 
values  in  R.  A  reformulation  of  the  normalized  cut  problem  and  relaxation  of  the  binary  constraint  lead  to  the  following 
task  of  finding  the  argument  of  the  minimum  of  the  task: 


min(iz,  Lm),  such  that  Du  _L  1 ,  (u,  Du)  =  vol(V), 

u£Rn 


where  vol(V )  =  ^2xd(x).  One  can  again  apply  the  Raleigh-Ritz  theorem,  and  the  solution  is  given  by  the  sec¬ 
ond  eigenvector  of  the  random  walk  Laplacian  D-1L.  Thresholding  can  be  used  to  obtain  a  binary  solution  which 
represents  the  desired  partition. 

The  last  normalization  to  be  discussed  is  the  Cheeger  cut ,  which  involves  finding  a  subset  S  of  V  that  minimizes 


h(S,  S ) 


cut(S ,  S) 
min(\S\,  |5|) 


If  volume  of  a  set  is  used  instead,  one  obtains  the  normalization 


hv(S,S)  = 


cut(S ,  S) 


min(vol(S) ,  vol(S))  ’ 

which  is  also  known  as  conductance.  Let  (j)(G)  =  minsey  hv(S,  S).  The  Cheeger  cut  and  the  second  eigenvalue  A2 
of  the  matrix  D-  i  LD_  i  can  be  related  via  the  following  Cheeger  inequality  [12]: 

HG)2 


^  A2  ^  2 cf)(G). 


(2) 


The  authors  of  [18]  present  a  generalized  version  of  it. 

We  have  seen  that  the  ratio  cut  and  the  normalized  cut  problems  can  be  formulated  in  a  relaxed  setting,  with  non¬ 
binary  solutions  given  by  the  second  eigenvectors  of  the  Laplacian  and  the  random  walk  Laplacian,  respectively,  with 
the  binary  answer  obtained  by  thresholding.  However,  for  an  arbitrary  number  of  classes,  this  technique  is  too  simple. 
The  method  of  spectral  clustering  computes  the  k  clusters  using  the  steps: 

1.  Formulate  the  data  set  in  a  graph  setting. 

2.  Compute  either  the  Laplacian  (L  =  D  —  W)  or  the  random  walk  Laplacian  (Lrw  =  D_1L). 

3.  Compute  the  first  k  eigenvectors  Vk }  of  the  Laplacian  (or  the  random  walk  Laplacian). 

4.  Let  U  be  the  matrix  containing  the  vectors  [v\,  ^2,...,  Vk }  as  columns. 

5.  Cluster  the  rows  of  the  matrix  U  with  the  fc-means  algorithm  into  k  clusters. 


Note  that  the  k-means  algorithm  for  finding  k  clusters  proceeds  iteratively  by  first  choosing  k  means  and  then 
assigning  each  point  to  the  mean  that  is  the  “closest”  to  the  point.  The  mean  of  each  cluster  is  then  recalculated.  The 
iterations  continue  until  there  is  little  change  from  one  iteration  to  the  next. 

There  is  a  large  literature  on  the  binary  partitioning  problem  on  graphs.  Here,  we  review  work  that  relates  to 
solving  the  cut  problem  under  some  normalization  condition.  In  [8],  the  authors  present  a  generalized  version  of 
spectral  clustering  using  the  graph  p-Laplacian.  They  show  that,  as  p  1,  the  cut  resulting  from  thresholding  the 
second  eigenvector  of  the  graph  p-Laplacian  is  the  solution  to  the  Cheeger  cut  problem.  An  efficient  scheme  to 
calculate  the  eigenvector  is  then  introduced.  In  [23],  Hein  et  al.  show  that  some  constrained  optimization  problems 
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can  be  formulated  as  nonlinear  eigenproblems.  The  authors  then  describe  a  generalization  of  the  inverse  power  method 
which  converges  to  nonlinear  eigenvectors.  This  method  is  applied  to  spectral  clustering  and  sparse  PC  A.  Moreover, 
in  [13],  Chung  describes  the  heat  kernel  as  the  pagerank  of  a  graph,  and  uses  it  and  the  local  Cheeger  cut  inequality  to 
establish  a  fast  graph  partitioning  algorithm.  In  [33],  Spielman  et  al.  present  nearly-linear  time  algorithms  for  graph 
partitioning,  graph  sparsification,  and  solving  linear  systems. 

Recent  work  by  Bresson,  Szlam,  Laurent,  von  Brecht,  et  al.  includes  several  important  papers  related  to  clustering. 
In  [35],  Szlam  et  al.  present  a  relaxation  of  the  Cheeger  cut  problem  and  then  show  similarities  of  the  energy  of  the 
relaxed  problem  and  well  studied  energies  in  image  processing.  An  algorithm  based  on  the  split-Bregman  method  [21] 
is  then  developed  to  minimize  the  proposed  energy.  Authors  of  [3]  describe  two  procedures  solving  the  relaxed 
Cheeger  cut  problem.  The  first  algorithm  is  a  novel  steepest  descent  approach,  and  the  second  one  is  a  modified 
inverse  power  method.  Some  convergence  results  are  also  given.  In  [5],  Bresson  et  al.  develop  another  version  of  the 
method  shown  in  [3]  using  a  new  adaptive  stopping  condition.  The  result  is  an  algorithm  that  is  monotonic  and  much 
more  efficient  than  before.  Note  that  the  mentioned  methods  are  unsupervised,  which  do  not  incorporate  any  known 
data.  As  opposed  to  these  kinds  of  algorithms,  semi-supervised  optimization  approaches  involve  the  inclusion  of  a 
fidelity  term,  a  fitting  term  to  known  data. 

Multiclass  extensions  relating  to  the  cut  have  also  been  proposed.  One  approach  is  to  use  recursion  and  thus 
solve  a  collection  of  binary  problems.  However,  other  ideas  have  also  been  introduced.  The  authors  of  [4]  present  a 
framework  for  multiclass  total  variation  clustering  that  does  not  use  recursion.  They  formulate  the  Cheeger  energy  in  a 
multiclass  setting,  and  then  relax  the  energy  in  a  continuous  setting.  This  results  in  an  optimization  problem  involving 
total  variation,  which  is  then  solved  using  the  proposed  proximal  splitting  algorithm.  In  [6],  a  multiclass  extension  of 
the  result  of  [35]  is  derived.  The  method  deals  with  learning  several  classes  simultaneously  with  a  set  of  labels,  and 
is  made  even  more  efficient  by  the  usage  of  fast  L 1  solvers,  designed  for  the  total  variation  semi-norm.  Finally,  we 
mention  the  work  of  Hu  et  al.  [25]  in  which  the  authors  formulate  modularity  optimization  as  a  minimization  problem 
using  an  energy  functional  that  contains  a  total  variation  term. 

In  addition  to  using  the  cut  as  a  basis  for  our  method,  one  technique  used  in  this  work  is  one  of  spectral  methods. 
Spectral  methods  are  a  type  of  approach  used  to  solve  differential  equations.  Their  idea  is  to  write  the  solution  of  the 
differential  equation,  and  perhaps  other  relevant  parts  of  the  problem,  as  a  sum  of  basis  functions.  The  coefficients  of 
the  sum  are  then  calculated  so  that  the  differential  equation  is  satisfied  (or  satisfied  to  the  best  ability).  In  this  work, 
we  use  the  eigenfunctions  of  a  Laplacian  as  basis  functions. 

1.2  Our  Contributions 

Specifically,  our  contributions  are: 

1 .  We  show  how  to  use  the  graph  Laplacian  to  solve  the  fully  nonlinear  ratio  cut  problem.  We  call  this  algorithm  the 
Modified  Ratio  Cut  Method  (RCCM),  and  present  it  in  Section  3.  Two  approaches  are  presented;  one  approach 
links  a  modified  ratio  cut  formulation  to  an  optimization  problem  involving  total  variation,  and  the  other  one 
deals  with  solving  the  ratio  cut  formulation  directly. 
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2.  We  show  how  to  use  the  graph  Laplacian  to  solve  the  fully  nonlinear  Cheeger  cut  problem.  We  call  this  algorithm 
the  Modified  Cheeger  Cut  Method  (MCCM),  and  present  it  in  Section  4. 

3.  A  main  advantage  of  these  methods  is  efficiency,  while  being  able  to  maintain  the  accuracy  of  that  of  the  state-of- 
the-art.  The  efficiency  is  partly  achieved  through  the  use  of  the  Ny strom  extension  method  [15, 16].  The  methods 
can  be  easily  extended  to  case  of  multiple  classes,  or  used  on  a  multiclass  data  set  via  recursive  bipartitioning. 

2  Graphical  Framework 

2.1  Fundamentals 

In  this  paper,  we  consider  a  graphical  framework  with  an  undirected  graph  G  =  ( V,E ),  where  V  is  the  set 
of  vertices  and  E  is  the  set  of  edges.  Earlier  in  Section  1,  we  introduced  some  important  concepts  related  to  the 
framework,  such  as  the  weight  function,  the  degree  and  the  similarity  matrix.  One  advantage  of  using  a  graphical 
framework  is  that  it  allows  one  to  be  non-local  and  take  into  account  the  relationship  between  any  two  nodes  in  the 
data  set.  In  the  case  of  image  processing,  this  makes  it  easier  for  one  to  capture  repetitive  structure  and  texture,  as 
shown  in  [29] .  Furthermore,  the  framework  is  also  very  general,  and  can  be  easily  applied  to  any  data  set. 

It  is  possible  to  introduce  some  common  mathematical  operators  in  a  graphical  setting.  In  this  section,  we  will 
only  be  concerned  with  the  graph  version  of  the  differential  Laplace  (A)  operator.  Although  many  versions  exist,  three 
popular  ones  are  the  unnormalized  Laplacian  L  =  D  —  W,  symmetric  Laplacian  Ls  =  ,  and  the  random 

walk  Laplacian  Lrw  =  D_1L.  The  graph  Laplacians  have  the  following  easily  shown  equations: 

1)  L  and  Ls  are  symmetric. 

2)  L,  Ls  and  Lrw  are  positive  semi-definite. 

3)  L,  Ls  and  Lrw  have  non-negative,  real- valued  eigenvalues  0  =  Ai  <  A2  <  ...  <  An. 

4)  The  smallest  eigenvalue  of  L  is  0;  the  corresponding  eigenvector  is  just  a  constant  vector. 

5)  A  is  an  eigenvalue  of  Lrw  with  eigenvector  u  if  and  only  if  A  is  an  eigenvalue  of  Ls  with  eigenvector  w  =  D^u. 
Note  that  the  multiplicity  of  eigenvalue  0  equals  the  number  of  connected  components  in  the  graph. 

2.2  Graphical  Framework,  Extended 

We  outline  the  graphical  framework  in  more  detail  here,  giving  general  definitions  of  other  operators.  The  operators 
are  defined  similarly  to  [22, 37],  where  the  justification  for  these  choices  is  explained. 

Let  n  be  the  number  of  vertices  in  the  graph  and  let  V  =  Mn  and  £  =  M  ( 2  }  be  Hilbert  spaces  (associated  with 
the  set  of  vertices  and  edges,  respectively)  defined  via  the  inner  products: 

(«,7>v  =  ^2u(x)'Y(x)d(x)r, 

X 

x,y 

X 
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Viz,  7  E  V  and  V0,  4>  E  £ ,  for  some  r  E  [0, 1]  and  q  E  [|,  1].  Let  us  also  define: 


IMv  =  V(w>w)v  =  ^^u(a;)2d(a;)’', 

ll^llf  =  vEwE  =  M^)29-j, 

V  ®>2/ 

||w||£2  =  =  ^^w(x)2, 

ll^llf.oo  =  max|</>(x,y)|. 

x,y 

The  gradient  operator  V  :  V  — )>  £  is  formulated  as: 

(Vii)ffi(x,  y)  =  w(x,  y)1~q(u(y)  -  u(x)). 

The  Dirichlet  energy  does  not  include  parameters  r  or  q\ 

\  II Vu|| £  =  t  ^  y)(u(x)  “  w(2/))2- 


®,2/ 


The  divergence  divw  :  £  — )>  V  definition  is  based  on  the  adjoint  of  the  gradient: 

(div™  0)(a;)  =  — f-  55  w(a  v)9{4>{x,  y)  -  0(y,  a:)), 

V  ^  y 

where  we  define  the  adjoint  using  the  following  definition:  (Viz,  0)£  =  —  (iz,  divw  </>)y. 
We  now  have  a  set  of  graph  Laplacians  Ar  =  divw  V  :  V  —>  V: 

(A wu)(x)  =  55  “  u(x))- 

In  matrix  form,  we  can  write  the  set  of  Laplacians  as 

—A,,,  =  D1_r  -  D_rW. 


(3) 


(4) 


(5) 


(6) 


(7) 


The  matrices  resulting  from  r  =  0  and  r  =  1  are  called  the  unnormalized  Laplacian  and  the  random  walk 
Laplacian,  respectively.  Using  divergence,  a  set  of  anisotropic  total  variations  TVW  :  V  — >  R.  can  now  be  formulated: 

TVw,q(u)  =  max  { (divm  <f>,  u)y  :(pe£,  ||^>||fj00  <  l}  =  ^^w{x,y)q\u{x)  -  u(y)\.  (8) 

x.y 

The  last  operator  to  be  defined  will  involve  the  Ginzburg-Landau  functional  that  our  methods  are  based  upon: 

GLe(u)  =  \\Vu\\l  +  -£J2  W(«(*)),  (9) 

X 

where  W  is  the  double  well  potential  W(u)  =  u2(u  —  l)2.  Note  that,  while  the  first  term  in  the  continuous  Ginzburg- 
Landau  functional 

GLc(u)  =  e  f  \Wu\2dx+  *  f  W(u)dx 
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is  scaled  by  e,  the  first  term  of  GLe  does  not  include  e.  This  is  because,  unlike  the  case  with  the  continuous  functional, 
the  difference  terms  of  GLe  are  finite  even  for  binary  functions,  and  no  rescaling  of  the  first  term  is  necessary. 

In  relation  to  parameters,  we  choose  q  =  1  since  in  [37],  it  is  shown  that  for  any  r  and  q  =  1,  TVW  is  the  T-limit 
(Gamma  convergence)  of  a  sequence  of  graph-based  Ginzburg-Landau  (GL)-type  functionals: 

p 

Theorem  1.  GLe  — >  GLq  as  e  0,  where 

GLe(u)  =  ||V«|||  +  1  y]W(w(x))  =  Iy^w(x,y)(w(a ;)  -  u(y))2  +  1  y^W(«(x)), 

a:  cc 

{TT4;  lM  /or  j.r.  tiLc)  G  {0,  1}, 
oo  otherwise. 

Proof.  See  Theorem  3.1  of  [37].  ■ 

Thus,  to  be  consistent  with  their  work,  use  q  —  1.  The  parameter  r  is  allowed  to  change,  since  we  experiment  with 
different  Laplacians.  With  the  parameters  so  defined,  total  variation  is  now  formulated  as 

TVw(u )  =  max  { (div„,  <fi,  u)\>  :<f>e£,  \\4>\\£t00  <  l}  =  ^  E  w(x,  y)\u(x)  -  u(y)\.  (10) 

x.y 

This  is  the  definition  used  in  the  paper. 


3  Modified  Ratio  Cut  Method 

3.1  Derivation 

Let  us  consider  the  binary  partitioning  problem  involving  the  target  set  X  (of  size  n)  embedded  in  a  graphical 
framework.  One  way  to  classify  the  set  into  two  classes  would  be  to  find  the  partition  that  minimizes  some  normaliza¬ 
tion  of  the  cut.  In  this  section,  our  derivations  are  based  on 

RatioCut(S ,  S)  =  cut(S ,  S)  f  7^7  + 

V|6| 

The  ratio  cut  is  related  to  the  Cheeger  cut  h(S,  S)  via  the  following  inequality: 

h(S,S)  <  RatioCut(S,S )  <  2 h(S,S), 

which  can  be  easily  shown  using  the  definitions.  We  now  present  two  different  approaches  to  the  ratio  cut  problem. 

3.1.1  Approach  I 

In  this  section,  we  do  not  solve  the  exact  ratio  cut  formulation,  but  instead  focus  on  a  slightly  modified  version. 
Earlier,  we  discussed  the  general  problem  of  minimizing  the  cut  to  find  a  desired  partition: 

min  ^ ' cut (S ,  S)  =  E  w{x,  y)j . 

xes,yes 
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Since  minimizing  the  cut  is  equivalent  to  minimizing  its  square,  we  can  consider  that  slightly  modified  problem. 
However,  some  sort  of  normalization  is  needed  to  create  a  bias  towards  partitions  where  the  two  clusters  are  of  similar 
size.  Using  the  regular  ratio  cut  normalization,  one  obtains  the  modified  ratio  cut  problem: 

mi„<mt(S,S)2(+  +  i),  (11) 

which  is  equivalent  to 

(12) 

Before  giving  more  details,  we  note  that  if  xs  '  G  -»  {0, 1}  is  the  indicator  function  of  a  subset  S  of  G,  then 

TVw(Xs)  =  cut(S,S).  (13) 


In  addition,  if  the  mean  of  a  function  u  :  V  — ¥  R  is  defined  as  -  J2iev  we  a^so  have 


\\XS  ~mean(xs) \\%  =  {xs(xi)  ~  — )  =lS'l(1_~)  +  l5l(^-) 


5I\2  I5II5I 


151  + 151’ 


where  x,  represents  node  i. 

Therefore,  due  to  (13)  and  (14),  problem  (11)  (and  (12))  is  equivalent  to  the  following  problem: 


T  Vw(u) 

mm  - . 

x(fc)e{o,i}  \\u  —  mean(u)\\C2 


This  is  an  NP-hard  problem,  and  one  way  to  simplify  it  would  be  to  relax  the  binary  condition  to  include  a  larger  set 
of  functions;  in  our  case,  we  minimize  over  all  functions  u  :  V  M  with  values  in  M.  We  now  show  that  there  exists 
a  binary  minimizer  of  the  relaxed  problem,  and  it  represents  the  solution  to  the  modified  ratio  cut  problem. 


(15) 


(16) 


Proof.  The  techniques  of  the  proof  are  similar  to  the  ones  in  [35].  Suppose  that  u*  is  the  minimizer  of  (15). 
We  have  TVw(u*)  ^  0,  otherwise  u*  would  be  constant,  and  the  ratio  in  (15)  would  be  undefined.  Note  that  the 
functional  in  (15)  is  scale  invariant;  we  can  thus  rescale  u*  so  that  TVw(u*)= 1.  In  this  case,  u*  is  a  maximizer  of 
the  denominator,  constrained  to  the  set  of  functions  u  such  that  TVw(u )  <  1.  Let  A\  be  the  set  of  indices  where  u* 
is  non-positive,  and  A2  be  the  set  where  it  is  positive.  Define  Z  to  be  the  set  of  functions  that  are  non-positive  on 
the  first  set  and  non-negative  on  the  second  one,  and  C  to  be  the  TV  norm  unit  ball  on  Z.  Clearly,  u*  is  a  solution  to 
max  || u  —  mean(u)\\C2-  Since  the  set  C  is  convex,  and  E(u)  =  || u  —  mean(u) ||£2  is  a  convex  functional  on  C,  a 
maximum  is  also  attained  on  at  least  one  of  the  extreme  points.  By  Lemma  2.2  of  [35],  any  extreme  point  of  C  takes  the 
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form  of  a  (scaled)  indicator  function  of  some  set  S*.  Therefore,  there  exists  an  extreme  point  v  =  a\s *  of  C,  which  is 
also  a  solution  to  max  \  \u  —  mean (u)  \\c2 .  Note  that  due  to  the  binary  property  of  v,  and  since  || u*  —  mean(u*)  ||£2  = 
\\v  —  mean (v)  ||£2  and  TVw(v)  <  TVw(u*)  =  1, 


mm 


n  ||u,  —  mean(u)  ||£2 


=  A  = 


TVw(u*) 


> 


TVw(v) 


\\u*  —  mean(u*)  ||£2  ||t>  —  mean(u)||£2 


=  cirf(S*,  5*)  ^  ^  >  min  cut(S,  S)  J  ^  +  p.  (17) 


Since 


min  cut(S,S)JJ]-  +  -lr  = 


Scv 

we  get  equality  in  (17).  ■ 

Therefore,  we  consider  the  problem 


mm 


Scv 


|S| 


|5|  \S\  u:u(x)e{o,i}  \\u  —  mean(u)\\C2 


>  min  - 


TVw(u) 


\\u  —  meamu 


II  £2 


mm  - 


TVw(u) 


(18) 


I  £2 


u  il  u  —  mean{u) 

Due  to  the  relationship  shown  in  Theorem  1 ,  the  idea  now  is  to  interchange  total  variation,  which  is  directly  related 
to  the  cut,  with  the  Ginzburg-Landau  functional.  This  will  allow  for  “cleaner”  minimization  techniques  due  to  the 
L 2  term  of  the  functional,  and  will  later  allow  us  to  express  all  the  calculations  via  the  Laplacian,  for  which  we  use 
a  spectral  approach.  We  later  introduce  the  use  of  the  Ny strom  extension  method  [15, 16]  -  a  clever  technique  that 
approximates  the  eigenvectors  of  a  Laplacian  using  only  a  small  portion  of  the  graph  weights.  This  technique  is  very 
efficient,  and  especially  useful  for  very  big  data  sets.  Replacing  the  total  variation  in  (18)  by  the  Ginzburg-Landau 
functional,  we  obtain  the  problem  of 

GLe(u ) 


minF('u)  =  min 


(19) 


u  \\u  —  mean(u)\\ C2 

We  denote  the  function  by  which  GLe(u)  is  multiplied  by  C\{u). 

To  solve  (19),  we  compute  the  modified  Allen-Cahn  equation  using  the  £2-gradient  flow  associated  with  (19). 
Since 

—F(u  +  tv)  =  (  —  GLe{u  +  tv))(Ci{u  +  tv))  +  GLe(u  +  tv)(  —  Ci(u  +  tv)), 

evaluating  the  derivative  at  zero  gives 

^F(u  +  tv)  |t=0  =  -2(Ci(u)Awu,v)c2  +  i(C'i(u)W,(u),  v)c^  +  (C2{u)GLe(u),v)c  2, 


where 


and 


C{u) 


\\u  —  meamu) 


C2(u)  =  - 


u  —  mean{u) 


||  u  —  mean(u)\\C2 


We  obtain  the  modified  Allen-Cahn  equation: 

du{pc) 


dt 


=  2C1(u)(Awu)(x)  -  CVVWHx))  _  C2(u)GLe(u)j 


(20) 

(21) 
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or  equivalently 


du(x) 

dt 


2C1(u)(Awu)(x) 


Ci(u)W'  (u(x)) 
e 


We  will  describe  how  we  solve  it  shortly. 


(22) 


3.1.2  Approach  II 


In  this  section,  we  consider  the  exact  ratio  cut  problem.  Let  u  be  a  binary  function  (taking  values  0  or  1)  denoting 
the  class  of  each  of  the  nodes.  Then  one  can  write  the  ratio  cut  problem  as  the  task 


min  TVv 

u:u(fc)E{0,l} 


+  n- T  _u(x))' 


(23) 


'Exu(x)  n~Exu(x)' 

Similarly  to  the  procedure  of  the  previous  section,  we  relax  the  binary  constraint  and  replace  the  total  variation  term 
in  (23)  by  the  GL  functional,  with  motivation  described  in  Section  3.1.1,  so  the  following  problem  is  formulated: 

1 


min  F(u)  =  GLe(u)(- 

u  \ 


4 — v 

G  u  (x) ) 


(24) 


kEx  u(x) 

Let  us  denote  the  second  term  in  the  product  by  C\  (u). 

To  solve  (24),  we  compute  the  modified  Allen-Cahn  equation  using  the  £2-gradient  flow  associated  with  (24). 
Since 

—F(u  +  tv)  =  (  —  GLe{u  +  tv))(Ci{u  +  tv))  +  GLe(u  +  tv)(  —  Ci(u  +  tv)), 

evaluating  the  derivative  at  zero  gives 

^F(u  +  tv)\t=o  =  —2{Ci(u)Awu,  v) cp-  +  ^(Ci(u)W (u),v) &  +  {C2(u)GL(u) ,  v) & , 


where 


and 


C^u) 


C2(u)  = 


+ 


Exu(x)  n~Exu(x) 


-l 


+ 


(Exu(x))2  (n~Exu(x))2' 

Again,  we  obtain  the  modified  Allen-Cahn  equation 

9u(x)  S.a/a  Ci{u)W{u(x))  n  J|2  ^WEi^W) 


dt 


=  2Ci(u)(Awu)(x)  - 


-C2(u)\\\7u\¥£- 


(25) 


(26) 


(27) 


3.2  The  Procedure  of  Modified  Ratio  Cut  Method  (MRCM) 

We  now  present  two  different  versions  of  the  algorithm  solving  (22)  (or  (27)).  The  first  version  computes  the 
gradient  term  of  the  Ginzburg-Landau  functional  explicitly  using  the  graph  weights  via  equation  (4).  The  second 
version  uses 

||Vu|||^-(Au,u)v  (28) 

and  spectral  composition  of  the  Laplacian  to  compute  the  Dirichlet’s  energy.  Since  the  Laplacian  is  the  only  other 
operator  in  (27),  one  can  use  the  Ny strom  extension  method  to  efficiently  approximate  the  eigenvectors  that  will  then 
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be  used  in  (28)  to  compute  the  Dirichlet’s  energy.  Only  a  small  portion  of  the  graph  weights  have  to  be  computed  in  the 
method’s  procedure,  which  results  in  more  efficiency,  and  this  is  the  advantage  of  the  second  version  of  the  algorithm. 
In  fact,  for  most  data  sets,  we  only  needed  to  compute  a  very  small  fraction  of  the  weights,  around  0.3  percent. 


3.2.1  Version  1 

Discretizing  (22)  and  calculating  the  Laplacian  implicitly,  one  obtains 

=  2Ci(un)(Awun+1)i  -  C.("")W'W)_C2(„.)||V„„.|||  OAu-)  E,  WW) 


u?+1  -  u? 


dt 


(29) 


To  solve  the  equation,  we  use  the  eigendecomposition  of  a  Laplacian  and  write  terms  in  the  above  equation  as  series 
involving  eigenvectors.  Let 

un  =  'Y^al4>k(x),  (30) 

k 

-  (£-w("-))a(t‘,,)  «£•«><*>.  <3D 

k 

where  {(j)k(x)}k  are  the  eigenfunctions  of  a  Laplacian.  Here,  we  calculate  the  Dirichlet’s  energy  term  explicitly  using 
(4).  After  plugging(30)  and  (31)  into  (29),  one  obtains 


a£+1  = 


al  +  dtbl 


1  +  2dtCi(un)Xk  ’ 


(32) 


where  is  the  kth  eigenvalue  of  a  Laplacian. 


3.2.2  Version  2 

Discretizing  (22)  and  calculating  the  Laplacian  implicitly,  one  obtains 

=  2Ci(un)(Awun+1)j  - 


u"+1  -  u? 


(33) 


dt  ...  '  -  e  e 

To  solve  the  above  equation,  we  use  the  eigendecomposition  of  a  Laplacian.  Let  un  =  Yh k  where  {c t>k{x)}k 

are  the  eigenfunctions  of  a  Laplacian.  Using  (28)  and  the  orthonormal  property  of  eigenvectors,  one  can  derive  the 
following  equation  involving  the  1 1 X7wun\  ||  term: 


||VwW”||l  =  ^Afc(^)2, 


(34) 


where  are  the  eigenvalues  of  a  Laplacian. 

Therefore,  when  calculating 

2  C^w^)  c2(un)  EiMO 

)\\ VWU  \\s - - - - - 

to  find  bn  (see  (31)),  we  use  (34).  Thus,  in  this  version,  the  coefficients  b2  are  defined  as: 


-C2(u")Y,M<)2  -  Cl("")W'(“',)  -  C2(“',)£‘tV(“?>)  =  (35) 


ll 


where  we  have  used  (34)  to  replace  the  Dirichlet’s  energy  term.  After  plugging  these  representations  into  (33),  one 
obtains 


a?+1  = 


dtbu 


l  +  2dtC1(un)\k’ 

where  Xk  is  the  kth  eigenvalue  of  a  Laplacian.  Below,  we  outline  the  steps  of  both  versions  of  the  method: 


(36) 


Modified  Ratio  Cut  Method  (MRCM) 

Here  u  will  be  a  function  whose  thresholded  value  indicates  the  class  membership.  The  steps  of  the  algorithm  are: 

*  Organize  the  data  set  in  a  graphical  setting,  compute  eigenvalues  and  eigenvectors  of  a  chosen  Laplacian. 

*  For  Approach  I,  calculate  C\  and  C2  via  (20)  and  (21),  respectively.  For  Approach  II,  calculate  C\  and  C2  via 
(25)  and  (26),  respectively. 

*  Initialize  u°,  calculate  a0  using  (30),  calculate  b°  using  (31)  (for  Version  1)  and  (35)  (for  Version  2). 

*  Repeat  the  following  steps  from  n  =  0  until  a  stopping  criterion  is  satisfied: 

-  For  Version  1,  calculate  an+1  using  (32).  For  Version  2,  calculate  an+1  using  (36). 

-  For  Version  1,  calculate  bn+1  using  (31).  For  Version  2,  calculate  bn+1  using  (35) 

-  Calculate  un+1  using  (30). 

*  Threshold  the  solution  to  obtain  a  binary  answer,  which  represents  the  assigned  class. 


4  Modified  Cheeger  Cut  Method 


4.1  Derivation 


We  again  consider  the  binary  partitioning  problem  involving  the  target  set  X  embedded  in  a  graphical  framework. 
We  use  similar  techniques  of  the  previous  sections,  but  using  the  Cheeger  cut.  As  described  in  Section  1,  the  Cheeger 
cut  problem  can  be  formulated  as  finding  the  subset  S  of  X  such  that  the  following  value  is  minimized: 


h(S,  S)  = 


cut(S ,  S) 
min(\S\,  |5|) 


If  u  is  a  binary  function  (taking  values  0  or  1)  denoting  the  class  of  each  of  the  nodes,  one  can  write  the  Cheeger 
problem  as  the  task 

min  TVW  ( u )  ( - 7— - — — - — - — —  ) . 

{u:uO)e{0,i}}  \min[}2x  u(x),n  -  u(x))  ' 

Unfortunately,  this  is  an  NP  hard  problem,  and  one  approach  is  to  relaxation  of  constraints.  We  relax  the  problem  by 
minimizing  over  functions  with  values  in  M: 


min  TV,, 


.(«)( 


min[y2,x  u(x),n  -  J2XU(X)) 


)■ 


(37) 
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Similarly  to  the  procedure  of  the  previous  section,  we  replace  the  total  variation  term  in  (23)  by  the  GL  functional, 
with  motivation  described  in  Section  3.1.1,  so  that  the  above  problem  is  formulated  as: 


lin  F(u)  =  GLe(u)  (  ,  — — 1 - — — —r). 

u  \min[}^xu{x),n  -  }^xu(x)) ' 


(38) 


■£*«(*))• 

We  denote  the  second  term  in  the  product  by  B\  (u). 

To  solve  (38),  we  compute  the  modified  Allen-Cahn  equation  using  the  £2-gradient  flow  associated  with  GL 
functional.  Since 

—F(u  +  tv)  =  (— GLe(u  +  tv))(Bi(u  +  tv))  +  GLe{u  +  tv)(  —  Bi(u  +  tv))f 
evaluating  the  derivative  at  zero  gives 

^tF(u  +  tv)\t=o  =  -‘2{B1(u)Awu,v)/:2  +  ^(Bi(u)W(u),v)cz  +  {B2(u)GL(u),  v)&, 


where 


and 


filW  = 


‘"(Ex  “(aO.n-E* 


mm 


s2(«)  =  4 


if(Ex«W)2>(»»-Ex«W)5 

( (s,«l))2  if(£*m))2  <  (n~Exu(V¥ 


The  modified  Allen-Cahn  equation  is  thus: 

du{pc) 


dt 


=  2B1(u){Awu)(x)  -  B^U)W«X))  _  B2(u)GLe(u) 


or  equivalently 

du(x) 


=  2Bi(u)(Awu)(x)  -  B‘(“)W'("(I>>  -  B2(»)l|V„u|||  -  m 


dt  e 

In  the  next  section,  we  describe  how  we  solve  this  equation. 


4.2  The  Procedure  of  the  Modified  Cheeger  Cut  Method 

We  present  two  different  versions  of  the  modified  Cheeger  cut  method.  As  mentioned  in  the  previous  section, 
Version  2  calculates  the  Dirichlet’s  energy  term  of  the  Ginzburg-Landau  functional  using  a  different  approach  than  the 
explicit  graph  formula  (4).  This  allows  one  to  use  the  Ny strom  extension  method,  due  to  the  Laplace  operator,  and 
thus  only  a  small  portion  of  the  graph  weights  need  to  be  calculated.  Thus,  this  version  is  more  efficient. 


4.2.1  Version  1 

Discretizing  (39)  and  calculating  the  Laplacian  implicitly,  one  obtains 

=  2B1(un)(Awun+1)i  -  B2(un)\\Vwun\\2£  - 


vA+1  -  u r 


»+lV  „.n,,2  B2(un)ZiW(<) 


dt 


(40) 
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To  solve  the  equation,  we  use 


un  =  'Y^a%4>k{x), 


(41) 


(42) 


e  e 

/c 

where  {</>£  (#)}•£  are  the  eigenfunctions  of  a  Laplacian.  The  Dirichlet’s  energy  term  is  calculated  explicitly  via  (4). 
After  plugging  (41)  and  (42)  into  (40),  one  obtains 


<+1  = 


qg  +  dtibl 


1  +  2dtB1{un)\k 


(43) 


where  \k  is  the  kth  eigenvalue  of  a  Laplacian. 


4.2.2  Version  2 


Discretizing  (39)  and  calculating  the  Laplacian  implicitly,  one  obtains 


u?+1  -  u? 


=  2B1(un)(Awun+1)i  -  B2(un)\\Vwu" 


,2  B1(un)W(  <)  (E,W«))B2(«’1) 


(44) 


dt  '  '  '  '  ' "  e  e 

To  solve  the  above  equation,  we  use  the  eigendecomposition  of  a  Laplacian.  Let  un  =  ak(j)k(x),  where  {4>k{x)}k 
are  the  eigenfunctions  of  a  Laplacian.  When  calculating 

p(l,n,\ |-  n,|2  B1(un)W(un)  B2 (Un)  E,  W«) 

)\\\/wU  \\£ - - - - - 

to  find  bn  (see  (42)),  we  use  (34).  Thus,  in  this  version,  the  coefficients  bk  are  defined  as: 


(45) 


e  e 

/c  K, 

where  we  have  used  (34)  to  replace  the  Dirichlet’s  energy  term.  Here,  is  the  kth  eigenvalue  of  a  Laplacian.  After 
plugging  these  representations  into  (44),  one  obtains 


<+1  = 


al  +  dtbl 


1  -^r  2dtB1(un)\k 

Below,  we  outline  steps  of  both  versions  of  the  algorithm: 


(46) 


Modified  Cheeger  Cut  Method  (MCCM)  -  Versions  1  and  2 

Here  u  will  be  a  function  whose  thresholded  value  indicates  the  class  membership.  The  steps  of  the  algorithm  are: 

*  Organize  the  data  set  in  a  graphical  setting,  compute  eigenvalues  and  eigenvectors  of  a  chosen  Laplacian. 

*  Initialize  u° ,  calculate  a0  using  (30),  calculate  b°  using  (42)  (for  Version  1)  and  (45)  (for  Version  2). 

*  Repeat  the  following  steps  from  n  =  0  until  a  stopping  criterion  is  satisfied: 

-  For  Version  1,  calculate  an+1  using  (43).  For  Version  2,  calculate  an+1  using  (46). 

-  For  Version  1,  calculate  6n+1  using  (42).  For  Version  2,  calculate  6n+1  using  (45). 

-  Calculate  unJrl  using  (30). 

*  Threshold  the  solution  to  obtain  a  binary  answer,  which  represents  the  assigned  class. 
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5  Parametrized  Laplacian  Framework 


In  [18],  Ghosh  et  al.  formulated  a  general  Laplacian  framework  by  considering  a  dynamic  process  on  graphs: 


=  — £0. 


Here,  0  is  a  column  vector  containing  the  values  of  the  dynamic  variable  for  all  vertices,  and  £  is  a  positive  semi- 
definite  matrix  called  the  spreading  operator,  which  defines  the  dynamic  process.  The  parameterized  Laplacian  family 
is  defined  as 


L  =  T-^D'-^D'  -  BWB)D’-^T~^. 

Compared  with  traditional  Laplacian  operators,  the  parametrized  Laplacian  has  two  additional  parameters:  T  and  B. 
The  diagonal  matrix  T  controls  the  time  delay  factors,  or  local  clock  rate  of  a  random  walk,  at  each  vertex.  The  bias 
factors  form  the  other  diagonal  matrix  B.  It  changes  the  trajectory  by  giving  random  walk  targets  different  weights. 
Note  that  the  degree  matrix  D'  is  defined  in  terms  of  the  biased  weight  matrix:  d[  =  J2j[Wf]ij  =  J2j[BWB]ij, 
where  we  set  W'  =  BWB. 

To  investigate  how  different  dynamic  processes  effect  our  perceptions  of  the  graph  structure,  we  study  how  the 
results  change  when  T  and  B  vary.  Recall  in  Section  2.2,  we  defined  a  similar  family  of  Laplacians  with  the  parameter 
r  (6).  As  we  will  show  for  the  following  four  special  cases,  they  can  be  formulated  as  equivalent  operators  up  to  a 
similarity  transformation. 

*  Normalized  Laplacian.  In  the  case  when  T  =  B  =  I,  we  obtain  the  symmetric  normalized  Laplacian: 

Ls  =  D-?(D-  W)D~^. 

The  normalized  Laplacian  is  related  to  the  random  walk  Laplacian  (r  =  1)  by  a  change  of  basis:  Ls  = 

*  Scaled  Graph  Laplacian.  In  the  case  when  B  =  I  and  T  =  dmaxD -1  (where  dma;E  represents  the  maximum 
degree  in  the  degree  matrix),  the  spreading  operator  is  called  the  scaled  graph  Laplacian: 

Lsci  =  B-(D-W). 
dmax 

It  is  just  the  matrix  (6)  with  r  =  0  scaled  by  ,  1  . 

Umax 

*  Unbiased  Adjacency  Matrix.  If  we  let  B  =  D~  i  and  T  =  d'^^D'-1  (where  d \'max  represents  the  maximum 

degree  in  the  degree  matrix  associated  with  Wr  =  we  obtain  the  unbiased  adjacency  matrix: 

Lunb  =  ~  D-^WD -3)  =  ~(D'  -  W'). 

^ max  ^ max 

The  unbiased  adjacency  matrix  is  related  to  (6)  with  r  =  0  (using  a  similarity  matrix  biased  by  D~  i )  by  a  factor 

of  d • 

max 
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*  Replicator.  If  we  let  B  =  V,  where  V  is  a  diagonal  matrix  whose  elements  are  the  components  of  the  eigen¬ 
vector  corresponding  to  the  largest  eigenvalue  of  W,  and  T  =  /,  we  obtain 

Lrep  =  I~  Y~W’ 

s^max 

where  A  max  is  the  largest  eigenvalue  of  W.  The  normalized  Laplacian  is  related  to  the  max  entropy  random 
walk  by  a  change  of  basis:  Lrep  =  D'^D'^^D'  —  VWV)D'~^ ,  where  D/_1(Zy  —  VWV )  is  the  max  entropy 
random  walk. 

We  utilize  these  four  different  formulations  of  the  Laplacian  in  our  paper,  and  the  results  are  described  in  the  next 
section. 


6  Results 


In  this  section,  we  show  results  for  some  benchmark  data  sets,  and  then  show  an  application  to  hyperspectral 
imagery.  We  observe  that  the  advantage  of  these  methods  is  efficiency,  while  being  able  to  maintain  the  accuracy  of 
state-of-the-art.  Moreover,  we  note  that  all  the  algorithms  presented  in  this  paper  are  alike  in  terms  of  efficiency. 

The  energy  minimization  proceeds  until  a  steady  state  condition  is  reached.  Once  the  change  of  the  norm  of  the 
vector  field  in  subsequent  iterations  falls  below  a  threshold,  the  system  is  no  longer  evolving  and  the  energy  decrement 
is  negligible.  In  the  experiments,  obtained  on  a  2.4  GHz  Intel  Core  i2  Quad  computer,  the  calculation  is  stopped  when 


Ik 


n+ 1 


'"III* 


k”+1ll|2 


<  v, 


(47) 


where  the  value  of  y  is  dependent  on  the  data  set. 

When  choosing  a  weight  function,  the  goal  is  to  assign  a  large  weight  to  an  edge  if  the  two  vertices  it  is  connecting 
are  similar  and  a  small  weight  otherwise.  One  popular  choice  for  the  weight  function  is  the  Gaussian 


_  M(x,y)2 

w(x,  y)  =  e  v2 


(48) 


where  M(x,y)  is  some  distance  measure  between  the  two  vertices  x  and  y,  and  cr  is  a  parameter  to  be  chosen.  For 
example,  if  the  data  set  consists  of  points  in  M2,  M(x,  y)  can  be  chosen  to  be  the  Euclidean  distance  between  x  and  y, 
since  points  farther  away  are  less  likely  to  belong  to  the  same  cluster  than  points  closer  together.  For  images,  M(x,y) 
can  be  defined  as  the  weighted  L2-norm  of  the  difference  of  the  feature  vectors  of  pixels  x  and  y,  where  the  feature 
vector  of  a  pixel  can  be  defined  as  the  set  of  intensity  values  in  the  pixel’s  neighborhood,  as  described  in  [20]. 
Another  choice  for  the  weight  function  is  the  Zelnik-Manor  and  Perona  function  [31]  for  sparse  matrices: 

M(x,y )2 

w(x,  y)  =  e  v) ,  (49) 

using  r(x)  =  M(x,  z)2,  where  z  is  the  Kth  closest  vertex  to  vertex  x.  The  parameter  K  is  to  be  chosen. 

Note  that  it  is  not  necessary  or  even  desirable  to  use  a  fully  connected  graph  setting ,  which  might  be  a  computa¬ 
tional  burden.  Specifically,  the  fully  connected  graph  can  be  approximated  by  a  much  smaller  graph  by  only  including 
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Table  1 :  Modified  Cheeger  and  ratio  cut  method  results  and  comparison 


Two  moons 


MNIST 


1Y11  1  1  kJ  A 

Method 

Accuracy 

Method 

Accuracy 

symmetric  Laplacian  (MRCM-  Approach  1) 

98.35% 

symmetric  Laplacian  (MRCM-  Approach  1) 

98.62% 

symmetric  Laplacian  (MRCM-  Approach  2) 

97.80% 

symmetric  Laplacian  (MRCM-  Approach  2) 

98.27% 

symmetric  Laplacian  (MCCM) 

98.00% 

symmetric  Laplacian  (MCCM) 

98.33% 

scaled  graph  Laplacian  (MRCM-  Approach  1) 

98.35% 

scaled  graph  Laplacian  (MRCM-  Approach  1 ) 

98.45% 

scaled  graph  Laplacian  (MRCM-  Approach  2) 

97.45% 

scaled  graph  Laplacian  (MRCM-  Approach  2) 

98.12% 

scaled  graph  Laplacian  (MCCM) 

97.70% 

scaled  graph  Laplacian  (MCCM) 

98.2% 

unbiased  adjacency  matrix  (MRCM-  Approach  1 ) 

98.05% 

unbiased  adjacency  matrix  (MRCM-  Approach  1) 

98.66% 

unbiased  adjacency  matrix  (MRCM-  Approach  2) 

97.55% 

unbiased  adjacency  matrix  (MRCM-  Approach  2) 

98.26% 

unbiased  adjacency  matrix  (MCCM) 

97.70% 

unbiased  adjacency  matrix  (MCCM) 

98.32% 

replicator  (MRCM-  Approach  1 ) 

97.35% 

replicator  (MRCM-  Approach  1 ) 

96.60% 

replicator  (MRCM-  Approach  2) 

95.50% 

replicator  (MRCM-  Approach  2) 

94.38% 

replicator  (MCCM) 

95.55% 

replicator  (MCCM) 

94.14% 

method  in  [35] 

95.08% 

method  in  [3] 

98.36% 

method  in  [8] 

93.5% 

Timing  for  our  methods 

2  to  15  sec. 

method  in  [23] 

95.38% 

Timing  for  method  in  [3] 

92.2  sec. 

method  in  [5] 

91.31% 

method  in  [3] 

95.86% 

an  edge  between  vertex  x  and  y  if  x  is  among  the  ^-nearest  neighbors  of  y  or  vice  versa.  This  is  called  a  k-nearest 
neighbor  graph.  We  sparsify  the  graph  in  this  way.  One  can  also  build  a  mutual  k-nearest  neighbor  graph  by  only 
including  an  edge  between  x  and  y  if  both  of  them  are  ^-nearest  neighbors  of  each  other.  If  two  vertices  x  and  y  are 
not  connected  by  an  edge,  the  weight  between  them  is  set  to  0. 

6.1  MNIST  Data  Set 

The  MNIST  digits  data  set  [26],  available  at  http://yann.lecun.com/exdb/mnist/ ,  is  a  data  set  of  70000  28  x  28 
images  of  handwritten  digits  from  0  —  9  to  be  classified  into  ten  classes.  Since  our  method  is  only  binary,  we  obtained 
a  subset  of  this  set  to  cluster  into  two  classes;  in  particular,  we  chose  digits  4  and  9  since  these  digits  are  sometimes 
hard  to  distinguish,  if  handwritten.  This  created  a  set  of  13782  digits,  each  either  4  or  9.  We  start  with  a  random 
initialization  of  the  class,  and  the  goal  is  to  classify  each  image  into  either  a  4  or  9. 

The  MNIST  data  set  results  for  both  versions  of  the  MCRM  (both  approaches)  and  MCCM  methods  are  shown 
in  Table  1 .  Our  methods  are  italicized,  and  the  best  method  is  written  in  bold.  From  the  table,  we  see  that  all  but  the 
replicator  achieve  accuracy  in  the  98th  percentile.  Using  the  replicator  worsens  the  accuracy  by  at  least  one  percentile. 
The  results  of  MRCM  (Approach  1)  are  visualized  in  Figure  1;  those  of  other  algorithms  are  not  shown  simply  because 
they  are  very  similar.  To  compare  to  some  state-of-the-art  work  involving  the  cut  with  normalization  (i.e.  Cheeger  cut, 
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balanced  cut,  etc.)  or  spectral  computations,  we  note  the  result  of  98.36%  of  Bresson  et  al.  in  [3].  We  see  that  our 
methods  achieve  accuracy  that  is  very  similar  to  that  result,  in  addition  to  being  very  efficient  computationally.  In  fact, 
after  the  graph  was  computed,  for  all  algorithms  presented  in  this  paper,  the  minimization  took  only  around  2  to  15 
seconds,  depending  on  the  number  of  iterations.  The  mean  time  for  the  computations  recorded  in  [3]  is  92.2  seconds. 


(d)  unbiased  adjacency  matrix  (e)  replicator 

Figure  1 :  MNIST  data  set  results 


6.2  Two  Moons  Data  Set 

The  data  set  is  constructed  from  two  half  circles  in  M2  with  a  radius  of  one  with  centers  at  (0,  0)  and  (1,  0.5).  A 
thousand  uniformly  chosen  points  are  sampled  from  each  circle,  embedded  in  M100  and  i.d.d.  Gaussian  noise  with 
standard  deviation  0.02  is  added  to  each  coordinate,  giving  a  set  of  two  thousand  points.  Starting  from  some  initial 
classification  of  the  points,  the  goal  is  to  segment  the  two  half  circles.  Here,  we  use  an  initialization  involving  the 
thresholded  second  eigenvector  of  the  chosen  Laplacian. 

The  results  for  both  versions  of  the  MRCM  (both  approaches)  and  MCCM  algorithms  are  shown  in  Table  1 .  The 
table  shows  that  all  but  the  replicator  always  achieve  accuracies  in  the  97th  or  98th  percentile.  The  results  of  MRCM 
(Approach  1)  are  visualized  in  Figure  2;  those  of  the  rest  are  not  shown  because  they  are  very  similar. 

To  compare  to  some  state-of-the-art  work  involving  the  cut  with  normalization  (i.e.  Cheeger  cut,  balanced  cut, 
etc.)  or  spectral  computations,  we  note  the  95.08%  result  of  Szlam  et  al.  in  [35],  the  93.5%  result  of  Buhler  et  al. 
in  [8],  the  95.38%  result  of  Hein  et  al.  in  [23],  the  91.31%  result  of  Bresson  et  al.  in  [5]  and  the  95.86%  result  of 
Bresson  et  al.  in  [3].  We  see  that  our  method  achieves  an  accuracy  that  is  at  least  2%  higher. 
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(a)  symmetric  graph  Laplacian 


(b)  scaled  graph  Laplacian 


(c)  unbiased  adjacency  matrix  (d)  replicator 

Figure  2:  Results  on  the  two  moons  data  set 

6.3  Application  to  Hyperspectral  Imagery 

We  apply  our  method  to  hyperspectral  data;  the  goal  is  to  track  chemical  plumes  recorded  as  a  hyperspectral 
video  sequence.  The  data  [7],  provided  by  the  Applied  Physics  Laboratory  at  John  Hopkins  University,  is  a  video 
sequence  of  plumes  released  at  the  Dugway  Proving  Ground.  The  images  are  of  dimension  128  x  320  x  129,  where 
the  last  dimension  indicates  the  number  of  channels,  each  depicting  a  particular  frequency  from  7,820  nm  to  1 1,700 
nm,  spaced  30  nm  apart.  The  sets  of  images  were  taken  from  videos  captured  by  three  long  wave  infrared  (LWIR) 
spectrometers,  each  placed  at  a  different  location  about  2  km  away  from  the  release  of  plume  at  an  elevation  of  1300 
feet.  One  hyperspectral  image  is  captured  every  five  seconds.  Other  work  on  this  data  set  can  be  found  in  [17],  [36] 
and  [34].  Prior  work  on  hyperspectral  plume  detection  using  other  sensors  includes  [24]  (MWIR)  and  [28]  (HYDICE). 

Since  our  data  depends  on  time,  to  form  a  graph  on  the  set,  we  select  k  different  video  frames  and  then  concatenate 
the  points,  allowing  for  the  data  to  be  associated  over  these  k  frames.  Although  the  resulting  graph  is  very  large  (for 
7  frames,  the  full  graph  Laplacian  is  an  n  x  n  matrix  with  n  =  286,  720),  we  use  the  Nystrom  extension  method 
to  effectively  calculate  the  desired  eigenfunctions.  The  algorithm  allows  one  to  calculate  an  approximation  of  the 
eigenfunctions,  while  only  having  to  compute  a  small  portion  of  the  graph.  In  this  case  of  a  very  large  graph,  we  need 
to  use  Version  2  of  the  proposed  algorithm,  because  the  goal  is  to  not  calculate  the  Dirichlet’s  energy  terms  using  the 
explicit  formula  (4),  which  would  be  computationally  inefficient,  but  use  the  eigenfunctions  of  the  Laplacian  and  (34). 

Figure  3  shows  a  sampling  of  four  different  eigenvectors  for  a  subset  of  7  frames.  Note  that  each  eigenvector 
highlights  a  different  aspect  of  the  image.  To  produce  a  good  initialization,  we  use  an  operator  assisted  method 
involving  spectral  clustering.  In  particular,  by  thresholding  appropriately  the  values  of  eigenvectors,  one  obtains 
information  about  a  particular  class.  For  example,  as  shown  in  Figure  3,  the  third  eigenvector  highlights  the  plume. 
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(a)  second  eigenvector 


(b)  third  eigenvector 


(c)  fourth  eigenvector 


(d)  fifth  eigenvector 


Figure  3:  Eigenvectors  of  the  symmetric  normalized  graph  Laplacian  for  a  subset  of  7  video  frames  shown  in  false 
color. 

By  thresholding  its  values,  one  can  find  the  pixels  that  are  most  likely  part  of  the  plume. 

We  tested  our  method  on  several  video  frames.  The  initialization  for  7  of  the  frames  is  displayed  in  Figure  4.  The 
final  segmentation  results  for  these  frames,  after  141  iterations,  are  shown  in  Figure  5.  The  plume  is  outlined  in  dark 
red,  and  the  background  shown  in  light  green.  This  experiment  took  around  23  seconds  after  the  graph  was  computed. 

7  Conclusion 

This  paper  introduces  new  methods  that  utilize  the  graph  Laplacian  to  solve  the  fully  nonlinear  Cheeger  and 
(modified)  ratio  cut  problems,  with  all  algorithms  making  use  of  the  Ginzburg-Landau  functional.  To  make  spectral 
and  graph  computations  efficient,  fast  numerical  routines  are  used.  The  experiments,  which  include  an  application  to 
hyperspectral  data,  show  that  the  algorithms  produce  results  that  are  comparable  with  or  better  than  some  of  the  best 
methods,  and  very  efficiently.  However,  there  is  more  work  to  be  done;  we  plan  on  extending  the  two  procedures  to 
the  multiclass  case,  and  finding  other  interesting  applications. 


(e)  fifth  frame  (f)  sixth  frame  (g)  seventh  frame 

Figure  4:  Plume  data  set  initialization 


(e)  fifth  frame  (f)  sixth  frame  (g)  seventh  frame 

Figure  5 :  Plume  data  set  results 
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Appendix 


Extensions  and  Variants 


One  can  derive  similar  algorithms  using  volume  of  a  set  instead  of  its  size,  where  volume  is  defined  as  vol(S)  = 
Y xes  d(x)i  where  d(x)  is  the  degree  of  vertex  x.  In  Approach  I,  we  can  propose  a  slightly  different  problem  of 


min  cut(S ,  S) 
scv  v  ' 


/ 


1 

vol(S) 


+ 


1 

voi(sy 
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Let  us  redefine  the  mean  to  be  mean(u )  =  ^  d(x)u(x))9  and  let  £2  be  the  weighted  L2  norm  with  the 

degrees  as  the  weights.  It  can  be  easily  shown  that 


||X5  ~  mean(xs)\\p 

The  rest  of  the  derivation  proceeds  as  before,  and  we  obtain 

1 


+ 


Ci(tz) 


\\u  —  meanyu 


C2(u)(x ) 


vol(S)  vol(S) 

d(x)(u  —  mean(u )) 


||  u  —  mean(u)\\g2 


In  Approach  II,  we  can  propose  a  slightly  different  problem  of 


mm  cu\ 
Scv 


vol(S)  vol(S) 


)■ 


Now,  Ci  and  C2  would  be  the  following: 


“  Exd(x)u(x)  +  Zxd(x)(  1  -  u(x)Y 
-2  2 

2[u)  -  (Ex  <*M*))2  +  Ex  <*(*)(!  -  «(*)))2 ' 

We  have  derived  the  similar  algorithms,  but  the  results  of  the  two  slightly  different  formulations  are  very  similar  to 
those  of  the  original  versions,  and  thus  we  only  state  the  results  for  the  original  formulation. 


Calculation  of  the  Eigenvalues/Eigenvalues  of  the  Graph  Laplacian 


Our  method  involves  the  computation  of  eigenvalues  and  associated  eigenvectors  of  graph  Laplacians.  In  practice, 
one  needs  to  compute  only  a  fraction  of  the  eigenvalues  and  eigenvectors  because  only  a  small  portion  of  the  nodes 
are  significant.  When  the  graph  is  sparse  and  is  of  moderate  size,  around  5000  x  5000  or  less,  we  use  a  Rayleigh- 
Chebyshev  procedure  outlined  in  [1].  It  is  a  modification  of  an  inverse  subspace  iteration  method  that  uses  adaptively 
determined  Chebyshev  polynomials.  When  the  graph  is  very  large,  the  Nystrom  extension  method  [15, 16]  can  be 
used.  It  is  a  matrix  completion  method,  and  is  often  used  in  many  applications,  such  as  kernel  principle  component 
analysis  [14]  and  spectral  clustering  [30].  This  procedure  is  especially  efficient  because  it  uses  approximations  based 
on  calculations  on  very  small  submatrices  of  the  original  matrix;  therefore,  it  is  useful  when  the  graph  is  very  large. 

Here,  we  show  how  to  apply  the  Nystrom  extension  method  in  the  case  of  the  symmetric  Laplacian,  but  it  can  be 
easily  applied  to  other  choices.  Note  that  the  eigenvectors  of  the  matrix  W  =  D~^WD~^  are  the  same  as  those  of 
the  symmetric  graph  Laplacian,  and  their  eigenvalues  have  a  very  simple  relationship.  Below,  we  formulate  a  method 
to  calculate  the  eigenvectors  and  eigenvalues  of  W  and  thus  of  Ls . 

Let  w  be  the  weight  function,  A  be  an  eigenvalue  of  W,  and  its  associated  eigenvector.  The  Nystrom  method 
approximates  the  eigenvalue  equation 


/  w(y,x)cj)(x)dx  =  \(j)(x) 

Jn 

using  a  quadrature  rule,  a  technique  to  find  weights  Cj(y)  and  a  set  of  l  interpolation  points  X  =  {xj}  such  that 


Y ]cj(y)4>(xj)  =  /  w(y,x)<f>(x)dx  +  E(y) 
3=1  Jn 
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where  E(y)  represents  the  error  in  the  approximation.  We  use  Cj(y)  =  w(y,Xj)  and  choose  the  l  interpolation  points 
randomly  from  the  vertex  set  V.  Denote  the  set  of  l  randomly  chosen  points  by  X  =  {xi}li=1  and  its  complement  by 
Y.  Pardoning  Z  into  Z  =  X  UY  and  letting  fa(x)  be  the  the  kth  eigenvector  of  W  and  its  associated  eigenvalue, 
we  obtain  the  system  of  equations 

^2  w(yi,xj)(f)k(xj)  =  Xk(f>k(yi)  Vfcel, 

Xj£X 

One  cannot  solve  these  equations  directly  because  fa  are  unknown.  However,  the  eigenvectors  of  W  are  approximated 
using  submatrices  of  W.  Let  Wxy  be  defined  as 

w{x  1,2/1)  ...  w(xi,  yn-i) 

_  w{xuyi)  ...  w(xi,yn-t) 

The  matrices  Wyx,  Wxx  and  Wyy  are  defined  similarly.  The  matrix  W  is  then 

WXx  Wxy 
Wyx  Wyy 

We  normalize  the  above  weight  matrix  to  calculate  the  eigenvalues  and  eigenvectors  of  W.  Here,  1  k  is  the  K- 
dimensional  unit  vector.  The  matrices  dx,  dy,  Wxx  and  Wxy  are  defined  as 

dx  =  Wxx  h  +  WxYln-l, 
dy  =  Wyx  h  +  (WYxW^xWxYfan-u 
Wxx  =  Wxx-/(sxsx)i 
Wxy  =  Wxy-/{sxs^x), 

where  A./B  denotes  componentwise  division  between  matrices  A  and  B ,  vT  denotes  the  transpose  of  vector  v, 
$x  =  \[dx  and  sy  =  fady.  It  is  shown  in  [2]  that  if  Wxx  =  ByDB^,  and  if  A  and  T  are  matrices  such  that 

AtTA  =  Wxx  +  WxxWxyWyxWxx* 

then  the  eigenvector  matrix  V  consisting  of  approximations  of  L  eigenvectors  of  W,  and  thus  of  the  symmetric  graph 
Laplacian,  is 

BxD^B^AT-  § 

WYxBxD-iB^AT-i 

while  I  —  T  contains  the  corresponding  eigenvalues  of  the  symmetric  graph  Laplacian  in  its  diagonal  entries. 

One  can  see  that  the  Nystrom  extension  method  is  efficient  because  it  approximates  the  eigenvalues  and  eigenvec¬ 
tors  of  an  n  x  n  matrix  (n  is  large)  by  calculations  using  matrices  that  are  much  smaller,  having  dimension  no  larger 
than  n  x  /,  where  l  is  small.  We  also  note  that  the  method  calculates  l  eigenvalues  and  corresponding  eigenvectors;  in 
practice,  only  a  very  small  portion  (we  use  around  0.3%)  of  eigenvalues  and  eigenvectors  of  the  graph  Laplacian  are 
needed  to  obtain  an  accurate  solution. 
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